Expected weighted on-base average (xWOBA) is the response in this project. xWOBA is designed to gauge a player's average offensive contributions per plate appearance and is calculated from the velocity and launch angle of the ball following contact. Here, we try to predict the xWOBA using purely pitching data in order to elucidate the important pitching variables that determine xWOBA values in order to remove a degree of noise from the evaluation of pitching performance and value while also performing inference on the important variables that determine a pitcher's xWOBA against the batters they face.

Data Exploration

Here, we try to elucidate what predictors display a clear relationship with xWOBA in order to inform the predictors we will eventually fit our model on.

Pitch Speed

First, we looked at our two speed related predictors-release speed and effective speed (which also factors in release position in addition to release speed). It seems like neither of the two are highly correlated with xWOBA.

This is somewhat unsurprising as, while one might think that, intuitively, faster pitches are harder to hit; with proper timing, a pitcher can pretty easily achieve hard contact on a poorly-placed fast ball. These trends suggest that location data of where the pitch crossed home plate might be more important to xWOBA than pure velocity related statistics.

ggplot(final_pitch_data, aes( x=release_speed,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$release_speed,final_pitch_data$xwoba)
## [1] 0.02418302
ggplot(final_pitch_data, aes( x=effective_speed,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$effective_speed,final_pitch_data$xwoba)
## [1] 0.0372182

Pitch Location Data

The variables examined below all relate to where, in reference to the strike zone, the ball crossed home plate. These variables are important as pitches thrown right down the center are easier to hit than pitches that are too high, low, or skewed to either the left or right side.

First, looking at the relationship between the strike zone and xWOBA below, perhaps unsurprisingly, we see a high density of high xWOBA values or hard-hit balls in the zones 4, 5, and 6. These three zones represent the center of the strike zone and thus pitches delivered there are likely to result in hard contact. The correlation we observe between zone and xWOBA is one of the higher values we get. While 0.17 is still ultimately quite low, the inherent stochastic nature of baseball might make it hard to get very high correlation values. However, the trend displayed below also suggested to us that we should create our own location variable as it is hard to construct an overall trend considering that xWOBA peaks at zones 4-6 but then tails off on either end.

ggplot(final_pitch_data, aes( x=zone,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$zone,final_pitch_data$xwoba)
## [1] -0.1719035

We next looked at the X (horizontal) and Z (vertical) locations at which the pitch passes home plate. The trend is fascinating in both of them as the high xWOBA values consistently fall within the middle region of both predictors. This is reasonable as pitches that pass the plate in the center are likely to result in a hit by the batter.

Because the X coordinates of the ball as it passes home plate are recorded with both positive (to the right) and negative (left) values, we tried an absolute value transformation, which assumes that the direction the pitch skew does not matter as much as the fact that it leans towards one direction. This transformation gave us a much higher correlation of 0.15, one of our stronger correlations thus far.

We also transformed the Z coordinates of the pitch as we observed that high xWOBA values tended to aggregate in the center of the distribution. Thus we took the absolute value of the difference between each point in order to represent how far away from the average they fall. This improved our correlation by quite a decent margin.

Overall, location data seems very promising.

ggplot(final_pitch_data, aes( x=plate_x,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$plate_x,final_pitch_data$xwoba)
## [1] -0.01139309
cor(abs(final_pitch_data$plate_x),final_pitch_data$xwoba)
## [1] -0.1543679
ggplot(final_pitch_data, aes( x=plate_z,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$plate_z,final_pitch_data$xwoba)
## [1] 0.02812215
cor(abs(final_pitch_data$plate_z-mean(final_pitch_data$plate_z)),final_pitch_data$xwoba)
## [1] -0.1476836

Pitch Movement

Pfx_x and pfx_z are predictors that indicate the horizontal or vertical movement of the ball in flight, respectively. For example, a negative pfx_z value suggests that the ball dropped substantially while a negative pfx_x value suggests leftward movement.

It seems like neither of the two predictors are highly correlated with xWOBA. Surprisingly, an absolute value transformation of either predictors did not increase the correlation. We might have to consider an interaction term between pitch type and vertical/horizontal movement as it is possible that while movement in isolation is not important, certain pitch types benefit more from a high or low degree of pitch movement.

ggplot(final_pitch_data, aes( x=pfx_x,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$pfx_x,final_pitch_data$xwoba)
## [1] -0.01286333
ggplot(final_pitch_data, aes( x=pfx_z,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$pfx_z,final_pitch_data$xwoba)
## [1] 0.0239281

Pitch Velocity and Acceleration

Here, we look at the velocity of the pitch in the X, Y and Z dimensions while also looking at the acceleration or rate of change of velocity.

First, looking at velocity, there does not appear to be an obvious relationship between velocity in any direction and xWOBA. Seeing as velocity in the x direction was recorded with both positive and negative values, we tried an absolute value transformation, which helped increase the correlation, albeit by a very small amount. We also used the aforementioned transformation to determine how far off from average a pitch was in terms of velocity in the Z dimension, which increased our correlation by a medium amount.

We might have to consider interaction terms between pitch type and velocity in order to make more use of these predictors.

ggplot(final_pitch_data, aes( x=vx0,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$vx0,final_pitch_data$xwoba)
## [1] 0.001823748
cor(abs(final_pitch_data$vx0),final_pitch_data$xwoba)
## [1] -0.006891287
ggplot(final_pitch_data, aes( x=vy0,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$vy0,final_pitch_data$xwoba)
## [1] -0.02448057
ggplot(final_pitch_data, aes( x=vz0,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$vz0,final_pitch_data$xwoba)
## [1] -0.007226455
cor(abs(final_pitch_data$vz0-mean(final_pitch_data$vz0)),final_pitch_data$xwoba)
## [1] -0.04977373

Next we looked at acceleration. Once again we don't see any obvious trends.

One interesting thing we noted was that there seems to be more high xWOBA values in the middle of the distribution for acceleration in the Y. We used the transformation previously-mentioned to quantify how far these points fall from the average. This transformation increased the acceleration in Z as well but, once again, only to a very small degree.

Even though this transformation did not accomplish what we had wanted it to here, it is a potentially useful transformation for other transformations.

ggplot(final_pitch_data, aes( x=ax,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$ax,final_pitch_data$xwoba)
## [1] -0.01280907
ggplot(final_pitch_data, aes( x=ay,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$ay,final_pitch_data$xwoba)
## [1] 0.008600517
cor(abs(final_pitch_data$ay-mean(final_pitch_data$ay)),final_pitch_data$xwoba)
## [1] -0.02370901
ggplot(final_pitch_data, aes( x=az,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$az,final_pitch_data$xwoba)
## [1] 0.02150236
cor(abs(final_pitch_data$az-mean(final_pitch_data$az)),final_pitch_data$xwoba)
## [1] -0.03188083

Release Position of the Pitch

As one might expect, the position at which the ball is released from the pitcher mound does not seem to correlate with the outcome of the pitch. Nevertheless, these two predictors were worth investigating. It seems like the more important location data is where the ball crosses home plate, not necessarily where the ball leaves the pitcher's hand.

ggplot(final_pitch_data, aes( x=release_pos_x,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$release_speed,final_pitch_data$xwoba)
## [1] 0.02418302
ggplot(final_pitch_data, aes( x=release_pos_x,y=xwoba)) +
  geom_point(alpha=0.2,color="dodgerblue",fill="dodgerblue")+
  theme_classic()

cor(final_pitch_data$release_pos_z,final_pitch_data$xwoba)
## [1] 0.02897294

Potential Interaction Terms